HCA-DAN: hierarchical class-aware domain adaptive network for gastric tumor segmentation in 3D CT images

Background Accurate segmentation of gastric tumors from CT scans provides useful image information for guiding the diagnosis and treatment of gastric cancer. However, automated gastric tumor segmentation from 3D CT images faces several challenges. The large variation of anisotropic spatial resolution limits the ability of 3D convolutional neural networks (CNNs) to learn features from different views. The background texture of gastric tumor is complex, and its size, shape and intensity distribution are highly variable, which makes it more difficult for deep learning methods to capture the boundary. In particular, while multi-center datasets increase sample size and representation ability, they suffer from inter-center heterogeneity. Methods In this study, we propose a new cross-center 3D tumor segmentation method named Hierarchical Class-Aware Domain Adaptive Network (HCA-DAN), which includes a new 3D neural network that efficiently bridges an Anisotropic neural network and a Transformer (AsTr) for extracting multi-scale context features from the CT images with anisotropic resolution, and a hierarchical class-aware domain alignment (HCADA) module for adaptively aligning multi-scale context features across two domains by integrating a class attention map with class-specific information. We evaluate the proposed method on an in-house CT image dataset collected from four medical centers and validate its segmentation performance in both in-center and cross-center test scenarios. Results Our baseline segmentation network (i.e., AsTr) achieves best results compared to other 3D segmentation models, with a mean dice similarity coefficient (DSC) of 59.26%, 55.97%, 48.83% and 67.28% in four in-center test tasks, and with a DSC of 56.42%, 55.94%, 46.54% and 60.62% in four cross-center test tasks. In addition, the proposed cross-center segmentation network (i.e., HCA-DAN) obtains excellent results compared to other unsupervised domain adaptation methods, with a DSC of 58.36%, 56.72%, 49.25%, and 62.20% in four cross-center test tasks. Conclusions Comprehensive experimental results demonstrate that the proposed method outperforms compared methods on this multi-center database and is promising for routine clinical workflows.


Introduction
Image-guided disease diagnosis and treatment is an important part of routine clinical workflow, particularly for gastric cancer, which is the third leading cause of cancer-related death worldwide [1].Computed tomography (CT) is the most commonly used imaging modality for preoperative assessment of tumor status, because it has the advantages of high imaging density resolution, convenient inspection, fast acquisition speed, and non-invasiveness [2].In clinical practice, imaging examination is usually performed manually by radiologists slice by slice [3], which is an expensive and time-consuming process and also relies heavily on the experience of radiologists.Automated segmentation of gastric tumors not only reduces the burden of radiologists, but also is expected to supplement the conventional imaging tools.However, this segmentation task is challenging due to the following reasons: (a) there exist anisotropic spatial resolution in 3D CT images, (b) low contrast between tumor and adjacent structures, (c) the large samples needed to train robust models are often difficult to obtain from a single medical center.
Previous studies using CT images to characterize gastric cancer were mainly oriented to some diagnostic tasks (e.g., estimate tumor invasion depth, predict lymph node metastasis, and identify occult peritoneal metastasis, etc.) [4][5][6][7], and these works usually performed taskspecific predictions based on the region of interest (ROI) of the primary tumor.In previous work, computer-aided diagnosis (CAD) methods are mainly based on radiomics to study gastric cancer in CT images.For example, Wang et al. [8] explored the potential performance of radiomics-based method for predicting the depth of tumor invasion in gastric cancer by performing tumor segmentation using dedicated post-processing software from enhanced CT images.Meng et al. [9].extracted 2D and 3D CT radiomic features from multi-center dataset and comprehensively compared 2D and 3D radiomic features for gastric cancer characterization and discrimination in three diagnostic tasks.Dong et al. [10].identified occult peritoneal metastasis in 554 gastric cancer patients from four centers.They first build radiomic signatures of the primary tumor and peritoneum based on 266 imaging features, and then combined the primary tumor, peritoneum and the Lauren types to predict the occult peritoneal metastasis.The above studies are all based on radiomics, which usually includes two stages: extracting ROI-based hand-crafted features and building traditional machine learning classifiers.The extraction of radiomic features is a very time-consuming feature engineering that usually requires domain-specific expertise.Furthermore, the methods proposed in the above works are not fully automatic and are not suitable for studying multi-center data due to complex data distribution and huge feature engineering.
With the rapid development of deep learning technology, CAD algorithms based on deep learning have achieved convincing performance in medical image analysis [11][12][13], particularly in some abdominal CT image analysis [14][15][16][17].Previous CNN-based deep learning methods were inevitably limited in modeling long-term dependencies by ignoring non-local correlations of images.Inspired by the success of Transformers in natural language processing (NLP) and computer vision (CV), Transformers is being widely used in medical image processing [18][19][20] as an alternative backbone for CNNs due to its ability to capture long-term dependencies.However, only a few deep learning-based CAD algorithms [21][22][23] have been proposed for automatic segmentation of gastric tumors from CT images, and these works are from us.In [21][22][23], we collected data from three medical centers to increase the sample size, but ignored the heterogeneity/shift of data from different sources.In medical image analysis, domain heterogeneity/shift is more prominent than conventional common data due to the changes of scanning instruments and the diversity of hospital population.Domain adaptation techniques are designed to reduce the domain shift and make the model go towards better generalization in the test phase.When the data distribution gap between source and target domains is narrowed, an improved generality can be obtained.For unsupervised domain adaptation (UDA), a specific scenario is when we have data from two or more medical centers/sites, it is usually assumed that the unlabeled data of one of the medical centers is the target domain, and the labeled data of the remaining one or more medical centers is source domain [24].UDA algorithms narrow domain discrepancy by strengthening information alignment from the perspectives of feature-level or image-level, and then improve the models performance with unlabeled target domain.Feature-level alignment-based methods transform source and target data into latent spaces, aiming to discover domain-invariant features by performing distributional alignment.Most of the methods adopt a Siamese architecture similar to the domain adversarial neural network (DANN) structure [25], which helps to obtain domaininvariant features.Image-level alignment-based methods are often used on paired data, which convert source images into target-like images and vice versa, facilitating segmentation models to learn specific information in the target domain.For example, Zhang et al. [26].proposed a DANN-based domain-symmetric networks to achieve feature distribution invariance at a finer category level.The proposed network is a symmetric design for source task and target task classifiers, and on this basis, the authors also build an additional classifier that shares neurons with the task classifiers.Hoffman et al. [27].proposed a model named Cycle-Consistent Adversarial Domain Adaptation (CyCADA), which adapts between domains using both generative image space alignment and latent representation space alignment.Inspired by the above work, some studies have investigated domain adaptation of deep neural networks and applied them to medical image analysis tasks.For example, Kamnitsas et al. [28].developed an unsupervised domain adaptation method for brain lesion segmentation by investigating adaptation between databases acquired using two different scanners with difference MR imaging sequences.Yan et al. [29].proposed an adversarial learning based UDA method for cross-vendor medical image segmentation.A domain discriminator is co-trained with the segmentor to learn domain-invariant features for the task of segmentation.Panfilov et al. [30].developed an unsupervised domain adaptive segmentation model based on adversarial learning for cross-device knee tissue segmentation.A U-Net-based segmentor and a domain discriminator with adversarial learning are co-trained for UDA.However, these above methods ignore class information in feature alignment, which results in misalignment.
In this paper, we propose a new hierarchical class-aware domain adaptive network (HCA-DAN) for gastric tumor segmentation in cross-center scenario.To simultaneously deal with anisotropy in 3D data and the long-range dependency on the extracted feature maps, we design a feature extraction backbone that efficiently bridges an Anisotropic neural network and a Transformer (AsTr) for extracting multi-scale features from the CT images.In particular, we also design a pyramid boundary-aware (PBA) block that is placed at multiple levels in the decoding path.Furthermore, we propose a hierarchical classaware domain alignment (HCADA) module, which not only considers tumor size in feature alignment, but also incorporates a class attention map into the domain discriminator to make the feature alignment process pay more attention to the class-specific information.In summary, our work has three main contributions:

Datasets and data pre-processing
This is a retrospective multi-center study with data from the four medical centers ( To cope with the limitation of 3D data on computer memory consumption, and considering that the tumor area is smaller than the background area, we cut and resample each volume to patches including voxels with a voxel size of 5.0 × 0.741 × 0.741mm 3 or 8.0 × 0.741 × 0.741 mm 3 .To compensate for the lack of training data, we not only use the online data augmentation [12] (e.g., flipping, rotation, translation), but also perform CT image normalization (automatic clipping operation from 0.5 to 99.5% intensity value of all foreground voxels) and voxel space resampling (with third order spline interpolation).

Network overview
Figure 1 shows the overview of the proposed HCA-DAN, which includes two collaborative components, i.e., AsTr and HCADA.The proposed 3D domain adaptation network takes an abdominal CT volume as input and starts with AsTr as backbone to extract multi-scale context features from the CT images with anisotropic resolution.Then the extracted features from source and target domains are passed to HCADA module, which can effectively distinguish the features of the source and target domains by taking into account class information.

Architecture of AsTr
Inspired by CoTr [18], AsTr is proposed to learn more discriminative multi-scale features for gastric tumor segmentation via jointing CNN and Transformer.AsTr consists of an anisotropic convolutional encoder (Asencoder) for feature extraction from the CT images with anisotropic resolution, a deformable Transformerencoder (i.e., DeTrans-encoder) for long-range dependency modeling, an anisotropic convolutional decoder (As-decoder) for accurate tumor segmentation.
To address the issue of anisotropic voxel resolution, we construct the As-encoder by combining anisotropic convolution with isotropic convolution, rather than simply using isotropic convolution.The As-encoder mainly contains a Conv-GN-PReLU block, two average pooling layers, two stages of anisotropic convolution block (AsBlock), and two stages of 3D squeeze-and-excitation residual (SE-Res) block.The Conv-GN-PReLU block represents a 3D convolutional layer followed by a group normalization (GN) and a parametric rectified linear unit (PReLU).The number of AsBlock in two stages are two and three, respectively.The number of SE-Res block in two stages are three and two, respectively.As shown in Fig. 2a, the input of AsBlock is delivered to 1 × 3 × 3 and 3 × 1 × 1 anisotropic convolutions, respectively.Then the outcomes are then concatenated with the input as the output.Moreover, the 1 × 1 × 1 convolution are employed to both input and output to adjust the channel numbers of features.Through this design, the As-encoder can independently extract features on the x-y plane and along the z direction from 3D volume, which reduces the influence of anisotropic spatial resolution.Considering that 3D data contains a wealth of information, we add two stages of SE-Res block in the back end of the As-encoder.As shown in Fig. 2b, the SE-Res block consists of residual and SE blocks, which not only improves the representation capability of the encoder, but also alleviates the overfitting problem caused by the deep network.
To compensate for the inherent locality of convolution operation, the DeTrans layer is proposed [18] to capture the long-term dependence of pixels in multi-scale features generated by the encoder.In general, the DeTrans layer is composed of a multi-scale deformable self-attention (MS-DMSA) layer and a feedforward network, each being followed by the layer normalization.
To capture more accurate tumor boundaries, in addition to AsBlock and SE-Res blocks, we also design the PBA block in As-decoder.Therefore, the As-decoder mainly contains two stages of AsBlock, two stages of 3D SE-Res block, four PBA blocks, four transpose convolution layers, and a Conv-GN-PReLU block.Inspired by 2D pyramid edge extraction module [31], we design the 3D PBA block (as shown in Fig. 2c) to refine the boundaries of the lesion.The PBA block is a simple and effective pyramid boundary information extraction strategy, which can obtain robust boundary information by capturing  F is generated by a series of operations, which can be defined as: where  It is worth noting that during decoding, the output sequence of the DeTrans layers is reshaped into feature maps according to the size at each scale.Then, the reshaped multi-scale features are added element-by-element in the decoding path for better tumor segmentation.

Hierarchical class-aware domain alignment
In this section, we consider how to use the class-specific information to guide multi-scale feature distribution alignment in our feature extractor AsTr.On the one hand, tumors of different cases have different sizes and positions in CT images, and multi-scale feature extraction has been proved to be very effective in many scenarios, especially in the task of lesion segmentation.Technically, low resolution feature maps tend to predict large objects, while high resolution feature maps tend to predict small objects.Therefore, we introduce the hierarchical domain alignment mechanism, which takes object scales roughly into account when performing domain distribution alignment.In short, we configure a domain discriminator for each scale feature, which can effectively guide the feature alignment of tumors of different sizes.On the other hand, many efforts ignore class-specific knowledge during feature alignment, which leads to misalignment.To encourage a more discriminative distribution alignment, we produce an attention map for each class separately, which is calculated based on the probability of class occurrence.The attention map is defined as: ,where F out denotes the output of the segmentation net- work AsTr.In other words, we use the Softmax function to calculate the class attention map for all output spatial positions.This class attention map is aggregated into the domain discriminator to capture class-specific information in domain adaptation, rather than class-agnostic information, which encourages more discriminative distribution alignment in the CADA block.Specifically, we employ the U-Net [32] architecture as a domain discriminator D in the CADA block.First, we upsample the feature generated by the PBA block with triple interpolation to the same resolution as the input image.The newly generated feature is then fed into the domain discriminator D and a probability map is generated to distinguish whether the feature is from the source or target domains.Finally, this probability map is multiplied by the class attention map element by element to obtain the final probability map.

Data partitioning and network implementation
We validate the proposed method in both in-center and cross-center test scenarios.In order to obtain reliable segmentation results, we employed a five-fold group cross-validation strategy in the in-center test scenario.
In the cross-center test scenario, we use three datasets as the source domain and the remaining one as the target domain, which is a common validation strategy for domain adaptive methods.The proposed cross-center 3D tumor segmentation method is implemented on the PyTorch platform and is trained with 1x NVIDIA GeForce RTX 3090 GPU (24GB).We train all 3D networks by using the SGD optimizer with a momentum of 0.99 and an initial learning rate of 1 × 10 − 3 .We set batch size as 2, and the network was trained for 500 epochs and each epoch contains 250 iterations.In four PBA blcoks, we use the 3 × 3 × 3 and 5 × 5 × 5 average pooling operation for the first two blocks, and 5 × 5 × 5 and 7 × 7 × 7 pooling kernels in the last two blocks.
We employ four performance metrics to quantitatively evaluate the obtained segmentation results, which include the Dice similarity coefficient (DSC), Jaccard index (JI), Average surface distance (ASD, in mm) and 95% Hausdorff distance (95HD, in mm).The first two are more sensitive to the inner filling of the mask, and the second two are more sensitive to the segmentation boundary.These metrics are calculated by the following formulas: where | * | and ∩ denote the size and the intersection operation in the set, respectively.x and y are the coordinates of the midpoint of the image, mean x∈X min y∈Y is average of the closest distance between two points, max x∈X min y∈Y is the shortest distance from a point in a point set to another point set.95% HD is similar to maximum HD, which is based on the 95th percentile of the distance between the boundary points in X and Y.

Loss function
We employ adversarial strategies to implement network training.Therefore, the proposed network consists of three losses, including segmentation loss L seg , discrimi- nation loss L h dis and adversarial domain adaptation loss L h da .The segmentation loss is the sum of Dice loss L dice and binary cross-entropy loss L bce , which defined as: where N is the voxel number of the input CT volume; p i ∈ [0.0,1.0]represents the voxel value of the predicted probabilities; g i ∈ {0,1} denotes the voxel value of the binary ground truth volume.
Following [33], we calculate the single-level discrimination and adversarial domain adaptation losses with the least squares loss function as follows: where f l P BA denotes l-th PBA block; l ∈ {1, 2, 3, 4} ; F l s and F l t represent the source domain and target domain features obtained in the layer before the l-th PBA block, respectively.Therefore, the hierarchical discrimination and adversarial domain adaptation losses are defined as: where λ l denotes the weight of l-th discrimination and adversarial domain adaptation losses, which decreases exponentially with the decrease of feature resolution.

Comparison with the state-of-the-art segmentation methods
To confirm the efficacy of the proposed AsTr, we compared it with six baseline/state-of-the-art (SOTA) medical image segmentation methods, including V-Net [34], 3D FPN [35], nnU-Net [12], CoTr [18], UNETR [19], and Swin-Unet [20].V-Net is designed to solve the 3D volume segmentation and is widely used in the segmentation task based on 3D medical image data.3D FPN is an effective method to extract multi-scale features, and it is used as a backbone for feature extraction in many works.nnU-Net is a robust segmentation method, which has achieved good results in many medical image segmentation tasks.CoTr is an efficient and effective method to bridge CNN and Transformer for 3D medical image segmentation.UNETR consists of a transformer encoder that directly utilizes 3D patches and is connected to a CNN-based decoder via skip connection.Swin-Unet is a pure Transformer-based U-shaped Encoder-Decoder network.We compare the first four methods in the in-center test scenario, and compare all methods in the cross-center test scenario.Tables 1 and  2 list the segmentation results of the above methods and the proposed method in in-center test and cross-center test scenarios, respectively.Compared with other segmentation networks, the proposed AsTr achieves the  3, our method is well ahead of these baseline methods in four in-center test scenarios (p<0.05),where significantly outperforming nnUNet in internal validation for D2 and D3, and rivaling nnUNet in internal validation for D1 and D4.
To confirm the efficacy of the proposed HCA-DAN, we compared it with three feature-level domain alignment methods, including Kamnitsas et al. [32], Yan et al. [33].and Panfilov et al. [34].For simplicity, we named the above methods UDA1, UDA2, and UDA3, respectively.These methods are similar to the proposed HCA-DAN in that they train a segmenter and one/more domain discriminators in an end-to-end manner.Table 4

Ablation study
To demonstrate the effectiveness of the proposed method for gastric tumor segmentation, we conducted two groups of ablation experiments.

Effectiveness of the PBA block
In medical image segmentation task, it is very important to accurately draw the lesion/object boundary.As shown in Fig. 3, we use the bar graph to plot the segmentation results of AsTr with or without the PBA block.We can intuitively see that adding PBA blocks to the decoding path can further improve segmentation performance.Although the PBA module demonstrated weak performance gains, it was able to steadily refine prediction boundaries in the four cross-center test scenarios.In Fig. 4, we also visualized 2D axial views of some segmentation results, which not only showed that the prediction of lesion boundaries by the proposed method was closer to the ground-truth, but also confirmed that PBA blocks were helpful for boundary refinement.

Effectiveness of the HCADA module
The core of the module is to consider tumor size and class-specific information during feature alignment to improve segmentation performance of segmentation network AsTr in the cross-center test scenario.Therefore, we can consider only one of the above two factors to conduct the comparative experiment.Our approach can automatically characterize gastric cancer and provide a whole tumor segmentation, which helps determine appropriate surgical approaches and predict prognosis.Although our approach outperforms other segmentation methods, there is still room for improvement in the tumor segmentation task.We believe that there may be two reasons.On the one hand, the voxel space distance of the data limits the segmentation performance.On the other hand, the small objective segmentation task is interfered by the background area.Therefore, our future research should not only focus on the heterogeneity between multi-center data, but also achieve higher tumor segmentation performance through two-stage modeling.The two-stage modeling strategy is more consistent with the clinical workflow, that is, the clinician first roughly determines the ROI and subsequently performs detailed lesion delineation.
In order to fully explore the performance of different models, we present number of FLOPs, parameters and averaged inference time of the models in Table 6.Similarly, AsTr has the second lowest averaged inference time after 3D FPN and is significantly faster than other models.
In addition, dataset D3 is particularly special in our four datasets.The voxel spacing between slices is 8 mm, which is different from the other three datasets.To explore this effect, we also set up a cross-center experiment without the participation of dataset D3.Table 7 lists the segmentation results of different segmentation methods.Comparing the segmentation results in Table 2, we found that the results decreased in all three crosscenter test scenarios, indicating that the amount of data was more important than data quality in our cross-center gastric tumor segmentation scenarios.Therefore, we will collect and study more centers and data in the future.
we use the different size and class-specific information of the lesion in the 3D representation.The extensive experiments under four test scenarios together with comprehensive ablation study and analysis demonstrate the effectiveness of our approach for cross-center 3D gastric tumor segmentation.
Although domain adaptation technology can effectively handle domain shift, domain adaptation-based methods require images from the target domain (labeled or unlabeled) for real-time model training or retraining.In realworld scenarios, it is time-consuming or even impractical to collect data from each new target domain to fine-tune the model before deploying it.In future work, we will employ domain generalization technology to address the domain shift problem in multi-center study.The goal of domain generalization technology is to learn a model from a single or multiple source domains so that it can be directly generalized to unseen target domains, which facilitates the widespread use and effective deployment of intelligent analysis models in the clinic.

Fig. 1
Fig. 1 The overview of the proposed HCA-DAN.AsBlock: anisotropic convolutional block; SE-Res: squeeze-and-excitation residual block; PBA: pyramid boundary-aware block; HCADA: hierarchical class-aware domain alignment module, which includes four CADA blocks.Note that to demonstrate an elegant framework, we omit the display of the positional encoding when the multi-scale features generated from the As-encoder are passed to the DeTrans layer

−
F n is obtained by average pooling layers with different kernel sizes; conv means a 1 × 1 × 1 convolutional layer; C represents channel concatena- tion operation; σ denotes a Sigmoid function; ⊗ indi- cates element-by-element multiplication.In this way, we obtain multiple granularities responses near the edge by subtracting the value of average pooling with different sizes from its local convolutional feature maps and configuring soft attention operation in each branch.

Fig. 2
Fig. 2 The architectures of three blocks

{mean
x∈X min y∈Y d (x, y) , mean y∈Y min x∈X d (x, y)} (5) HD = max {max x∈X min y∈Y d (x, y) , max y∈Y min x∈X d (x, y)}(6) Number of FLOPs and inference time are calculated based on an input size of 28 × 256 × 256.The proposed AsTr is a relatively small model with 18.67 M parameters and 388.09GFLOPs.For comparison, other transformerbased methods such as CoTr, UNETR, and Swin-Unet have 41.27 M, 145.85 M and 102.81 M parameters and 670.62G, 2201.41G and 1582.56GFLOPs, respectively.AsTr shows comparable model complexity and is significantly better than similar models.CNN-based segmentation models of VNet, 3D FPN and nnUNet have 45.60 M, 7.83 M and 44.80 M parameters and 676.23G, 56.71G and 691.17GFLOPs, respectively.Compared to these methods, AsTr has the second lowest parameters and FLOPs.

Fig. 4 Fig. 3
Fig.4 The DSC values obtained by the proposed AsTr in four cross-center test scenarios with or without the help of PBA block

Table 1
Segmentation results of different methods in the in-center test scenario

Table 2
The p-values of the paired t-test between the proposed AsTr and other methods in terms of DSC.

Table 3
Segmentation results of different methods in the cross-center test scenario

Table 4
Segmentation results of different UDA methods in the cross-center test scenario